I am stuck in my home as most people these days due to the coronavirus, and I feel nostalgic of the time when I was able to freely run in nature and public parks. To remember those goold old days, I decided to create a visualization of the runs I used to do in the “Bois de Vincennes” park, when I lived in Paris. The idea is to see where I used to run in the most : a heatmap will be perfect in this regard.
For this little project, I am going to use R and the indispensable ggplot library combined with ggmap. I will also give a try to the leaflet for r library that I’ve never used before.
library(tidyverse)
library(xml2)
library(tools)
library(ggmap)
library(leaflet)
library(sp)
library(rgdal)
library(KernSmooth)
I am retrieving my data from the Strava website : all of my runs are recorded on my Strava account. This is pretty handy because Strava lets you bulk download all of your files : have a look at this help page.
The zip file that you download contains a lot of files. Only some of them will interest us :
'<trkseg>
<trkpt lat="48.8496940" lon="2.4355100">
<ele>61.2</ele>
<time>2016-02-05T17:36:12Z</time>
</trkpt>
<trkpt lat="48.8498080" lon="2.4354640">'
As you may have guessed, the next step will be to convert these files into a single tidy dataframe. This is the purpose of the next session.
The first thing we want to do before parsing all of the files is to filter only the files which contains runs coordinates. The file activities.csv is perfect for that, let’s load it and explore it.
df_activities <- read_csv("../data/activities.csv")
str(df_activities)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 106 obs. of 10 variables:
## $ id : num 4.60e+08 4.67e+08 4.69e+08 4.71e+08 4.72e+08 ...
## $ date : POSIXct, format: "2016-01-01 10:26:36" "2016-01-10 08:40:06" ...
## $ name : chr "Course à pied du midi" "Course à pied matinale" "Course à pied en soirée" "Course à pied matinale" ...
## $ type : chr "Run" "Run" "Run" "Run" ...
## $ description : logi NA NA NA NA NA NA ...
## $ elapsed_time: num 2736 3835 1594 4567 2473 ...
## $ distance : num 8343 7174 5213 12968 6799 ...
## $ commute : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ gear : logi NA NA NA NA NA NA ...
## $ filename : chr "activities/460488679.gpx" "activities/467047992.gpx" "activities/468881650.gpx" "activities/471346941.gpx" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. date = col_datetime(format = ""),
## .. name = col_character(),
## .. type = col_character(),
## .. description = col_logical(),
## .. elapsed_time = col_double(),
## .. distance = col_double(),
## .. commute = col_logical(),
## .. gear = col_logical(),
## .. filename = col_character()
## .. )
What a convenient file ! All of the records are logged in and we immediately know the type of activity thanks to the “type” column. Even better, the relative path to the gpx file is given for each activity.
Let’s filter this data to keep only the run activities. As most of the columns are irrelevant for our analysis, we will remove them as well.
df_activities <- df_activities %>%
filter(type == "Run") %>%
select(id, date, type, elapsed_time, distance, filename)
head(df_activities)
## # A tibble: 6 x 6
## id date type elapsed_time distance filename
## <dbl> <dttm> <chr> <dbl> <dbl> <chr>
## 1 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activities/4604~
## 2 4.67e8 2016-01-10 08:40:06 Run 3835 7174. activities/4670~
## 3 4.69e8 2016-01-12 17:42:57 Run 1594 5213. activities/4688~
## 4 4.71e8 2016-01-16 09:39:53 Run 4567 12968. activities/4713~
## 5 4.72e8 2016-01-17 13:23:44 Run 2473 6799. activities/4723~
## 6 4.74e8 2016-01-19 17:32:13 Run 3343 11238. activities/4738~
That’s better. Now it is time to have a look at the gpx file. As we saw, it is a xml file, so it will be easy to parse it with xml2 library and the xpath query langage.
Let’s try that first with a single file.
gpx_file <- "../data/activities/486549502.gpx"
test_file <- read_html(gpx_file)
trkpt_nodes <- xml_find_all(test_file, "//trkpt") # use xpath to parse all trkpt nodes
# collect GPS coordinates (lon / lat) within these nodes
lon <- as.numeric(xml_attr(trkpt_nodes, attr = "lon"))
lat <- as.numeric(xml_attr(trkpt_nodes, attr = "lat"))
# create a dataframe with the run id and the GPS coordinates
id <- gpx_file %>%
strsplit("/") %>%
unlist() %>%
tail(1) %>%
file_path_sans_ext() %>%
as.numeric()
df_coordinates <- data.frame("id" = id, lon, lat)
head(df_coordinates)
## id lon lat
## 1 486549502 2.435510 48.84969
## 2 486549502 2.435464 48.84981
## 3 486549502 2.435476 48.84975
## 4 486549502 2.435366 48.84969
## 5 486549502 2.435194 48.84973
## 6 486549502 2.435159 48.84972
And “voila” ! That was pretty simple thanks to xpath. We could have retrieve other attributes such as elevation or time, but there are of no use for this visualization. The id column will help us merge this table with our “activities” dataframe.
It is time to replicate what we just did to the other files. The code has already been written, we just need to insert it into function and process all the files in the “activities” dataframe.
# this function retrieves the gps coordinates of a gpx file
gpx_coordinates_retriever <- function(gpx_file){
x <- read_html(gpx_file)
trkpt_nodes <- xml_find_all(x, "//trkpt") # use xpath to parse all trkpt nodes
# collect GPS coordinates (lon / lat) within these nodes
lon <- as.numeric(xml_attr(trkpt_nodes, attr = "lon"))
lat <- as.numeric(xml_attr(trkpt_nodes, attr = "lat"))
# create a dataframe with the run id and the GPS coordinates
id <- gpx_file %>%
strsplit("/") %>%
unlist() %>%
tail(1) %>%
file_path_sans_ext() %>%
as.numeric()
df_coordinates <- data.frame("id" = id, lon, lat, stringsAsFactors = FALSE)
df_coordinates
}
files_path <- paste0("../data/",df_activities$filename) # create a valid path to access gpx files
list_coordinates <- lapply(files_path, gpx_coordinates_retriever) # apply the function to all files
df_total_coordinates <- do.call("bind_rows", list_coordinates) # concatenate all dataframe in a single one
# join our coordinates dataframe with the activities one
df_activities_coordinates <- left_join(df_activities, df_total_coordinates, by = "id")
head(df_activities_coordinates)
## # A tibble: 6 x 8
## id date type elapsed_time distance filename lon
## <dbl> <dttm> <chr> <dbl> <dbl> <chr> <dbl>
## 1 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.646
## 2 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.646
## 3 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.646
## 4 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.646
## 5 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.645
## 6 4.60e8 2016-01-01 10:26:36 Run 2736 8343. activit~ -0.645
## # ... with 1 more variable: lat <dbl>
This looks great, but we’re not done yet : to optimize the creation of the charts, and as I’m only interested in my runs in the Bois de Vincennes park, I will filter to remove gps coordinates which are clearly not in this area.
A quick look at OpenStreetMap and I see that the Bois de Vincennes park coordinates ranges from 48.8125 to 48.8505 for the latitude, and 2.3838 to 2.4856 for the longitude.
#
lat_min = 48.8125
lat_max = 48.8505
lon_min = 2.3838
lon_max = 2.4856
df_activities_coordinates <- df_activities_coordinates %>%
filter(lat >= lat_min & lat <= lat_max ) %>%
filter(lon >= lon_min & lon <= lon_max )
Let’s first start by making a simple plot : we are going to plot every coordinates as a dot. We need a background map before plotting though, that’s the role of the ggmap library which gets along very well with ggplot2.
This library enables to use a lot of different sources as your base map, from Google Maps to OpenStreetMap. However their usage has been restricted quite a lot lately : OpenStreetMap is no longer supported at the moment (see this Github issue) and you need now to register to Google services to use their API, even if it remains “mostly” free for limited usage. I am using Stamen as it does not require a API key and is based on open source data.
map <- get_map(location=c(left = lon_min,
bottom = lat_min,
right = lon_max,
top = lat_max),
zoom = 14,
source="stamen",
maptype = "watercolor")
mapPoints <- ggmap(map)
We now have our base map. All we need to do is to plot the dots. This is an easy step : as mentionned, ggmap gets along very well with ggplot. And as it is pretty straighforward to build a scatterplot with ggplot, we can quickly generate our first draft.
Note that I set the alpha parameter (the transparency) of the dots to 0.01 (on a scale from 0 to 1) to have a first idea of the most frequent areas.
mapPoints +
geom_point(data = df_activities_coordinates,
aes(x = lon, y = lat),
alpha = 0.01)
This is not too bad for a first draft ! I really love ggplot for its simplicity and its efficiency : only a few lines of code to get a decent chart.
Now to move on to the next level, I am going to generate a density chart : the “busiest” area will be colored red while the places I did not go very frequently will be blue. To be able to better see the colors, I will use another base map for Stamen : the toner map, which is basically a black & white version of the map. I add some titles, fine tune some visual parameters and this chart should be good to publish (at least on this blog) !
mapBW <- get_map(location=c(left = lon_min,
bottom = lat_min,
right = lon_max,
top = lat_max),
zoom = 14,
source="stamen",
maptype = "toner")
mapPointsBW <- ggmap(mapBW)
mapPointsBW +
stat_density_2d(data = df_activities_coordinates,
aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), geom = "polygon")+
scale_alpha(range = c(0.3, 1))+
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Runs in Bois de Vincennes",
subtitle = "Distribution of GPS coordinates",
caption = "Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.")
We can clearly see on this map that I was particularly active in the north of the park, there is especially an “intensive” path that leads to the place I used to live in (outside of the map). On the south-west side of the map, there is a red-ish area that is visible : this a hill that I particularly enjoyed climbing to get a panoramic view of the park. But the most obvious thing we can deduct from this map is that I missed quite a big part of the park during my runs. In my defense, it is a pretty big park : 3 times the size of Central Park !
Static maps are nice, but interactive maps are even better ! When it’s possible, I always prefer to create interactive visualization to let the audience “play” with the data.
I generally rely on Folium, a great Python library based on Leaflet which is… a Javascript library that enables to create interactive maps for the Web. I had never tried before to do the same with R, so I decided to give it a try. Unfortunately there is no true equivalent of Folium in the R ecosystem (yet ?). I have tested many libraries but none of them have all the capabilities found in Folium. Most of them are moreover not effectively maintained.
Rstudio, the company behind the famous IDE, has created the leaflet for r library. It is very effective to create basic interactive maps but is not able to create heatmaps from scratch. Fortunately with the help of other packages, it is possible to achieve it. I found this particularly interesting thread on StackExchange that show an elegant solution to this problem. I tweaked it a bit (get rid of data.table, change the bandwidth) to adjust it to my needs and here is our beautiful interactive map :
# Make Contour lines - bandwidth choice is based on MASS::bandwidth.nrd()
kde <- bkde2D(df_activities_coordinates %>% select(lon, lat) %>% as.data.frame(),
bandwidth=c(.0009, .0014), gridsize = c(200,100))
CL <- contourLines(kde$x1 , kde$x2 , kde$fhat)
# Extract Contour Lines level
LEVS <- as.factor(sapply(CL, `[[`, "level"))
NLEV <- length(levels(LEVS))
# Convert Contour Lines to Polygons
pgons <- lapply(1:length(CL), function(i)
Polygons(list(Polygon(cbind(CL[[i]]$x, CL[[i]]$y))), ID=i))
spgons = SpatialPolygons(pgons)
# Leaflet map with polygons
leaflet(spgons) %>% addTiles() %>%
addPolygons(color = heat.colors(NLEV, NULL)[LEVS])
Sadly the main insight is the same : I need to go back to Paris and to run a lot to complete my exploration of the Bois de Vincennes !
In conclusion, R provides some nice tools to create maps. I am particularly fan of the combination of ggplot2 with ggmap as they enable to quickly build ready-to-publish charts with only a few lines of code. I used this a lot for my work when I was working for a transport reseller : a map enables to quickly grasp spatial information.
I am less enthusiastic for the making of interactive maps with R. It is maybe because I am used to do it with Python (and Folium), but I found it a bit more tedious in R as there are not a single package designed to do this kind of visualization. But leaflet for R remains a good solution for building basic interactive maps.